BigQueryIO uniformize direct and export reads #32360

RustedBones · 2024-08-29T13:13:53Z

Refers to #26329, also fix #20100, #21076

When using readWithDatumReader and DIRECT_READ method, the transform would fail because the parseFn is expected. Refactor the IO so the avro datumReader can be use in both cases.

In some case, it is required to get the data with the desired schema. Currently, BQ io always uses the writer schema (or table schema). Create new APIs to set the reader schema.

~~This refactoring contains some breaking changes:~~

withFormat is not exposed anymore. Indeed, it is not possible to configure a TypedRead with a DatumReaderFactory and change the format later. Data format MUST be chosen when creating the transform.

In the TypedRead.Builder, replace the DatumReaderFactory with the BigQueryReaderFactory allowing to handle both avro and arrow in uniform fashion. This alters the BigQueryIOTranslation.
I need some help on that point to handle that in a better way.

Edit: reworked part of this PR to keep compatibility

github-actions · 2024-08-29T14:36:02Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

RustedBones · 2024-08-30T11:37:50Z

assign set of reviewers

github-actions · 2024-08-30T11:38:59Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @ahmedabu98 for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

RustedBones · 2024-08-30T11:40:49Z

Some BQ integration tests are failing.

I don't know schema & data of the following big_query_import_export.parallel_read_table_row_xxx tables so I can recreated the setup in a personal GCP project. Can someone give me a hand ?

RustedBones · 2024-09-04T08:14:32Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

+      // read table schema and infer coder if possible
+      Coder<T> c;
+      if (getCoder() == null) {
+        tableSchema = requestTableSchema(sourceDef, bqOptions, getSelectedFields());


Is it fine to access the BQ table at graph creation time? (It was already doing that when beam schema was requested)

Yeah this is a valid concern. I've heard use case where pipeline submission machine does not or has incomplete permission to the resource, and infer schema at graph creation time can cause issue. General guideline is the use case used to work should be able to work still (and vice versa)

RustedBones · 2024-09-04T08:16:57Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

@@ -1731,7 +1870,7 @@ public void processElement(ProcessContext c) throws Exception {
                                          .setTable(
                                              BigQueryHelpers.toTableResourceName(
                                                  queryResultTable.getTableReference()))
-                                          .setDataFormat(DataFormat.AVRO))


was arrow even supported ?

...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryReaderFactory.java

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceDef.java

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java

RustedBones · 2024-09-04T08:44:37Z

...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java

@@ -89,8 +95,8 @@ static class BigQueryIOReadTranslator implements TransformPayloadTranslator<Type
            .addNullableBooleanField("use_legacy_sql")
            .addNullableBooleanField("with_template_compatibility")
            .addNullableByteArrayField("bigquery_services")
-            .addNullableByteArrayField("parse_fn")
-            .addNullableByteArrayField("datum_reader_factory")
+            .addNullableByteArrayField("bigquery_reader_factory")


This is a complex object to serialize. subject to serialization error if there's changes between versions

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java

github-actions · 2024-09-11T12:14:19Z

Reminder, please take a look at this pr: @robertwb @ahmedabu98

github-actions · 2024-12-04T12:14:55Z

Reminder, please take a look at this pr: @kennknowles @ahmedabu98

github-actions · 2024-12-09T12:15:11Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @chamikaramj for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

RustedBones · 2024-12-10T13:18:55Z

I did some refactoring, reducing the breaking changes and allowing an easier transform upgrade

It should be possible to read BQ avro data using a provided compatible avro schema for both file and direct read. Add readRows api Improve coder inference Self review Fix concurrency issue spotless checkstyle Ignore BigQueryIOTranslationTest Add missing project option to execute test Call table schema only if required Fix avro export without logical type checkstyle Add back float support FIx write test Add arrow support in translation

Reduce breaking changes by configuring IO with simple objects

github-actions · 2024-12-24T12:14:28Z

Reminder, please take a look at this pr: @robertwb @chamikaramj

github-actions · 2024-12-27T12:14:45Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @damondouglas for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions · 2025-01-04T12:17:15Z

Reminder, please take a look at this pr: @kennknowles @damondouglas

github-actions bot added java io extensions gcp zetasketch labels Aug 29, 2024

RustedBones force-pushed the bq-read-schema branch from 5dbe389 to 989ccdd Compare August 30, 2024 08:59

github-actions bot added the examples label Aug 30, 2024

github-actions bot added the Next Action: Reviewers label Aug 30, 2024

github-actions bot added examples and removed examples labels Aug 30, 2024

RustedBones marked this pull request as draft September 3, 2024 07:18

github-actions bot added examples and removed examples labels Sep 3, 2024

RustedBones marked this pull request as ready for review September 3, 2024 19:48

github-actions bot added examples and removed examples labels Sep 4, 2024

RustedBones commented Sep 4, 2024

View reviewed changes

...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java Outdated Show resolved Hide resolved

RustedBones commented Sep 4, 2024

View reviewed changes

...ogle-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java Outdated Show resolved Hide resolved

github-actions bot added the slow-review label Sep 11, 2024

github-actions bot added the slow-review label Dec 4, 2024

github-actions bot added sql kotlin and removed slow-review sql kotlin labels Dec 9, 2024

RustedBones force-pushed the bq-read-schema branch from adc31e2 to a7a81be Compare December 10, 2024 09:55

github-actions bot added kotlin and removed kotlin labels Dec 10, 2024

RustedBones added 5 commits December 16, 2024 17:57

Update translation IO

a92207b

Remove avro parameter from generic interface

7858658

Reduce breaking changes

3c5efbf

Reduce breaking changes by configuring IO with simple objects

Handle non serializable avro Schema

afab5e0

RustedBones force-pushed the bq-read-schema branch from a7a81be to c73e4e0 Compare December 16, 2024 17:23

github-actions bot added kotlin and removed kotlin labels Dec 16, 2024

Consistent naming with AvroIO

1c52b2a

RustedBones force-pushed the bq-read-schema branch from c73e4e0 to 1c52b2a Compare December 17, 2024 10:10

github-actions bot added kotlin and removed kotlin labels Dec 17, 2024

github-actions bot added the slow-review label Dec 24, 2024

github-actions bot removed the slow-review label Dec 27, 2024

github-actions bot added the slow-review label Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQueryIO uniformize direct and export reads #32360

BigQueryIO uniformize direct and export reads #32360

RustedBones commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Aug 29, 2024

RustedBones commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

RustedBones commented Aug 30, 2024

RustedBones Sep 4, 2024

Abacn Sep 16, 2024

RustedBones Sep 4, 2024

RustedBones Sep 4, 2024

github-actions bot commented Sep 11, 2024

github-actions bot commented Dec 4, 2024

github-actions bot commented Dec 9, 2024

RustedBones commented Dec 10, 2024

github-actions bot commented Dec 24, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Jan 4, 2025

BigQueryIO uniformize direct and export reads #32360

Are you sure you want to change the base?

BigQueryIO uniformize direct and export reads #32360

Conversation

RustedBones commented Aug 29, 2024 • edited Loading

github-actions bot commented Aug 29, 2024

RustedBones commented Aug 30, 2024

github-actions bot commented Aug 30, 2024

RustedBones commented Aug 30, 2024

RustedBones Sep 4, 2024

Choose a reason for hiding this comment

Abacn Sep 16, 2024

Choose a reason for hiding this comment

RustedBones Sep 4, 2024

Choose a reason for hiding this comment

RustedBones Sep 4, 2024

Choose a reason for hiding this comment

github-actions bot commented Sep 11, 2024

github-actions bot commented Dec 4, 2024

github-actions bot commented Dec 9, 2024

RustedBones commented Dec 10, 2024

github-actions bot commented Dec 24, 2024

github-actions bot commented Dec 27, 2024

github-actions bot commented Jan 4, 2025

RustedBones commented Aug 29, 2024 •

edited

Loading